Rational selection of putative peptides from identified nucleotide or peptide sequences

ABSTRACT

The present invention relates to a method for identifying putative peptides from nucleotides or peptide sequences of unknown function such as both nucleic acid and peptide precursors of a peptide comprising an amidated C-terminal end and, more particularly, to a method wherein putative peptide precursors are identified from a genetic database.

PRIOR ART RELEVANT TO THE INVENTION

[0001] (i) Field of the Invention

[0002] The present invention relates to a method for identifying putative peptides from nucleotide or peptide sequences of unknown function such as both nucleic acid and peptide precursors of a peptide comprising an amidated C-terminal end and, more particularly, to a method wherein putative peptide precursors are identified from a genetic database.

[0003] (ii) Description of Related Art

[0004] Certain combinations of nucleotides, when present in a polynucleotide, are known to give rise to certain properties in the polypeptide translated there from. One example includes those nucleotides which encode polypeptide hormone precursors which undergo a post-translation amidation reaction. Another example, as disclosed in U.S. Pat. No. 4,917,999, relates to certain nucleotides which are characteristic of polypeptides exhibiting α-amylase enzymatic activity.

[0005] Amidated polypeptide hormones are synthesized in the form of a precursor which undergoes maturation. This maturation consists of an amidation reaction. The amidation reaction of the C-terminal end is a characteristic reaction of amidated polypeptide hormones. This reaction, which occurs on the precursor of one or more hormones, allows maturation of the hormone and also ensures its biostability in the physiological medium: the amide group formed is less vulnerable than the free acid function. The hormone is therefore more resistant to carboxypeptidases, it remains active in the cell for longer and retains an optimum affinity for its receptor site.

[0006] Amidation has been widely described (<<Peptide amidation”, Alan F. Bradbury and Derek G. Smyth, TIBS 16: 112-115, March 1991 and <<Functional and structural characterization of peptidylamidoglycolate lyase, the enzyme catalysing the second step in peptide amidation>>, A. G. Katopodis, D. S. Ping, C. E. Smith and S. W. May, Biochemistry, 30(25): 6189-6194, June 1991), and its mechanism is as follows:

[0007] 1—Cleavage of the polypeptide precursor chain of the hormone by an endoprotease at a sequence of two basic amino acids (that is to say a sequence of two amino acids chosen from arginine and/or lysine),

[0008] 2—Subsequently two cleavages by carboxypeptidase, which results in the extended glycine intermediate,

[0009] 3—The enzyme PAM (peptidyl-glycine-α-amidating monooxygenase) comprises two distinct enzymatic activities: firstly, it converts the extended glycine intermediate into an α-hydroxyglycine derivative, the subunit of the enzyme PAM involved being PHM (peptidyl-glycine-α-hydrolylating monooxygenase). The derivative obtained serves as the substrate for the second subunit of PAM (called PAL: peptidyl-α-hydroxyglycine-α-amidating lyase), which fixes the amine function of the glycine on to the amino acid immediately adjacent to the N-terminal side and liberates glyoxylate.

[0010] This reaction involves the presence of a recognition site on the precursor of the hormone or hormones, a site which always comprises the sequence: glycine and two basic amino acids (arginine or lysine). The amidated polypeptide hormones which are secreted outside the endoplasmic reticulum are known to comprise a consensus signal sequence of about fifteen to thirty amino acids, this sequence being present at the N-terminal end of the polypeptide chain. It is cut later by a signal peptidase enzyme in such a way that it is no longer found in the protein once secreted.

[0011] Given the importance of known amidated polypeptides in the context of numerous biological systems, methods have been sought for the identification of additional amidated polypeptides. Unfortunately, at the present time, the discovery of a new protein is not easy.

[0012] To date, the art has developed a certain number of approaches in an attempt to identify novel proteins of potential biological interest.

[0013] In one of these approaches, new proteins of interest are isolated from a source by selecting a specific property which the researcher believes will be possessed by one or more potential proteins of interest in a sample. According to this approach, proteins can be isolated and purified by various techniques: precipitation at the isoelectric point. selective extraction by certain solvents and then purification by crystallization, counter-current distribution, adsorption, partition or ion exchange chromatography, electrophoresis.

[0014] The conventional protein isolation techniques described above provide only limited success in the isolation and identification of new biological molecules of interest. This approach implies knowledge of the properties of the protein to be isolated. Typically, when the isolation of proteins is carried out by the use of a common property, one of the two following situations arises. In the first situation, the common property will be for the most part unrelated or only marginally related to the biological function of the molecules being isolated. One could envision, for example, two proteins sharing identical isoelectric points but having completely unrelated biological functions. In the second situation, separation might be achieved based on a common property which is very closely related to the biological function of the molecule about to be isolated. In this category, for example, one might envision molecules which bind to the same receptor molecule. In the former situation, the isolation of potentially new polypeptides is quite unfocussed given its likelihood of isolating compounds of completely unrelated biological function. By complete contrast, the latter situation suffers the exact opposite deficiency in that it enables isolation of only a very limited number of new biologically interesting molecules.

[0015] Thus, a person skilled in the art seeking to isolate potentially new polypeptides of interest by conventional protein separation techniques was confronted with the constant dilemma of obtaining a hodgepodge of biologically unrelated polypeptides or, alternatively, only a very specific set of polypeptides.

[0016] Another serious shortcoming of conventional techniques for the isolation of new polypeptides from a sample is linked to the nature of the sample itself. Obviously, there will be a limited number of available polypeptides for isolation and identification in any given biological sample. Furthermore, great care must be taken with such samples to ensure the continued integrity of the biologically active molecules therein.

[0017] Not surprisingly, previous attempts to isolate and characterize new peptides comprising an amidated C-terminal end have followed the conventional approach of starting with a biological sample and choosing from the arsenal of known separation techniques for isolating and identifying the peptides. For example, in U.S. Pat. No. 5,360,727 in the name of Matsuo et al., there was isolated a C-terminal alpha-amidating enzyme of porcine origin by extracting and purifying the enzyme from porcine atrium cords exhibiting the enzyme activity. In U.S. Pat. No. 5,871,995 issued in the name of lida et al., enzymes participating in C-terminal amidation were purified from a biological material such as horse serum by affinity chromatography using a peptide C-terminal glycine adduct as a ligand. In U.S. Pat. No. 4,708,934 in the name of Gilligan et al., peptidyl-glycine alpha-amidating monooxygenase enzyme was extracted from medullary thyroid carcinoma cell lines and tissue samples. Where identification of substantial numbers of new polypeptides capable of amidation is the goal, conventional isolation techniques such as these are completely unsuitable, as they typically permit isolation of only a single polypeptide of interest from a source likely to contain that polypeptide.

[0018] In PCT WO 99/10361, the assignee of the present application has recently developed a method which overcomes many of the disadvantages discussed above in that it enables the rapid identification of a large number of putative peptides which comprise an amidated C-terminal end. In particular, unlike earlier techniques which relied on a particular physical property of the polypeptide to isolate it from a source suspected or known to contain it, the method developed by the assignee relies on a characteristic of the peptide sequence of the precursor of all amidated hormones known to date, thereby allowing simultaneous detection of several new hormones of this category. More particularly, this technique relies on the direct identification of the nucleotide sequence which codes for the precursors in cDNA banks prepared from tissues in which the precursors of these hormones can be synthesized.

[0019] The method of the PCT Application WO 99/10361 permits identification of the precursor of a peptide having an amidated C-terminal end, by the following successive stages:

[0020] 1—Obtaining a DNA bank

[0021] 2—Hybridization of one or more oligonucleotides OX with the DNA bank;

[0022] 3—Identification of the DNA sequence or sequences of the bank which hybridizes with an oligonucleotide OX;

[0023] 4—Identification in this sequence or sequences of one or more peptides with a possible amidated C-terminal end.

[0024] OX is a single-stranded oligonucleotide which can hybridize under mild conditions with an oligonucleotide OY of the sequence Y1-Y2-Y3-Y4-Y5, in which Y1 represents a nucleotide sequence of 1 to 12 nucleotides or Y1 is suppressed, Y2 represents a trinucleotide which codes for Gly, Y3 and Y4 independently represent a trinucleotide which codes for Arg or Lys and Y5 represents a nucleotide sequence of 1 to 21 nucleotides or Y5 is suppressed.

[0025] Preferably, the DNA bank is a cDNA bank. A cDNA bank contains the cDNA corresponding to the cytoplasmic mRNA extracted from a given cell. The bank is said to be complete if it comprises at least one bacterial clone for each starting mRNA.

[0026] Hybridization takes place if two oligonucleotides have substantially complementary nucleotide sequences, and they can combine over their length by establishing hydrogen bonds between complementary bases.

[0027] Searching with the method of PCT Application WO 99/10361 has been found to be much less restricting than the above mentioned conventional techniques of biochemistry, since:

[0028] it can lead to the isolation of several distinct precursors present in the same tissue by the same principle;

[0029] it allows detection, under the same technical conditions, of precursors corresponding to hormones which have very different biochemical and biological properties;

[0030] it allows concomitant identification of all the peptide hormones which can be contained in the same precursor.

[0031] As a result, the screening technique set forth in PCT Application WO 99/10361 allows a not insignificant saving in time and money in a sector where research and development costs represent a very high proportion of turnover.

[0032] By allowing the obtaining of a large number of potentially therapeutically useful polypeptides, the developed technique allows the pharmacological study of active substances having a fundamental physiological roll in the mammalian organism: hormones and more particularly amidated polypeptide neurohormones. Given the availability for the first time of cDNA corresponding to active substances, it is now possible to introduce a cloned vector by genetic engineering to lead to synthesis of hormones having a therapeutic use by means of microorganisms.

[0033] Although giving rise to numerous significant advantages in terms of the ability to rapidly obtain a large number of putative candidate peptide molecules which serve as precursors to peptides comprising an amidated C-terminal end, there are nonetheless still certain difficulties linked to the use of a cDNA bank for carrying out the selection by screening. As discussed earlier, the cDNA bank is typically derived from a single cell and consequently will contain only those polypeptides which are expressed in that cell. This means that even if it is present within the genome of the cell, the screening method will not detect a putative peptide if that peptide is not expressed in that cell. Furthermore, even to the extent that a polypeptide of interest is expressed in the cell, the screening technique is necessarily limited to polypeptides expressed by that particular cell and, indeed, by the particular species of life from which the cell is derived. This makes it difficult to screen for the vast numbers of putative peptides which are of interest.

[0034] Thus, the method of PCT Application/WO99/10361 solved a very important restriction in the identification of putative peptides serving as precursors of peptides comprising an amidated C-terminal end. More specifically, while the method of PCT Application/WO 99/10361 certainly provides the means through which a cDNA bank can be probed for all possible sequences having the desired post-translational amidation property, it's use was nonetheless restricted to the cDNA present in the cDNA bank.

[0035] Recently, there has been an interest in using available databases containing vast numbers of nucleotide sequences in order to, for example, compare a sequence of interest with known sequences.

[0036] For example, in U.S. Pat. No. 5,706,498, a gene database retrieval system is disclosed for obtaining a gene sequence having a sequence similar to sequence data from a gene database. The gene database stores the sequence data of genes whose structures or sequences have already been analyzed and identified. The system includes a dynamic programming operation unit for determining the degree of similarity between target data and key data by utilizing the sequence data of the bases of the gene from the gene database as target data and the sequence data of the bases as the key for retrieval, and a central processing device unit to carry out the process of accessing the gene database, parallel to the operation process for determining the degree of similarity by transmitting the sequence data of the bases from the gene database continually one after another into the dynamic programming operation unit as the target data, by controlling the gene database and the dynamic programming operation unit.

[0037] U.S. Pat. No. 5,873,082 discloses a method for searching a gene database characterised in that a homologous sequence of a given sequence is the object of a search in the gene database and in that the results are formulated in the form of results ordered in a sequence of increasing homology. According to the patent, a plurality of lists having similarities and differences can be effectively compared. In the case of the gene database search results, a large number of lists including a huge number of sequence names can be quickly compared.

[0038] U.S. Pat. No. 5,577,249 discloses a method for finding a reference sequence in a database. The most preferred embodiment has a specific application in searching the genome of living organisms, in particular the human genome, to find locations and purposes of nucleotide sequences and other biological information that are found on sequences of DNA. The method employs human genome databases commercially available which have subsequences of the DNA chains broken down into nucleotide token sequences. A unique original index associated with the original DNA string is then created. A reference nucleotide sequence is selected. The reference indexes and original indexes are compared. The method was applied to match reference sequences of nucleotides for the genome of E. coli which contains approximately 4 million nucleotides.

[0039] U.S. Pat. No. 5,701,256 discloses a method and apparatus intended for comparisons of sequences characterised in that new protein sequences are compared with known sequences, for example from a sequence database, typically with a view to determine what level of similarity is shared by the proteins in terms of structural and functional characteristics.

[0040] U.S. Pat. No. 5,523,208 discloses a method for screening nucleotide or DNA sequence databases to identify genetic regions or genes coding for biologically interacting proteins. In particular, the method provides a means of screening databases consisting of cloned genetic material, including but not limited to DNA, RNA, mRNA, tRNA and nucleotide fragments, to identify the genetic function of genetic material of unknown function. The method, when used on DNA fragments of unknown coding potential will produce a list of gene fragments which code for proteins having the potential to form complexes or multimeric configurations with the unknown protein.

[0041] As the above discussion demonstrates, one of the major applications for computerized methods of searching databases of genetic information is the comparison of a newly found sequence of unknown function with a database of sequences of known function. The goal of such methods is of course to ascertain the function of the unknown protein by relating it to structurally similar proteins of known function. Another application of computerized methods is to find those sequences in a database which are structurally related. Still further methods try to find sequences which form complexes with a known sequence. The focus of all of these methods is generally the comparison of one sequence with another sequence with a view to determining which sequences are structurally similar, rather than determining which genes, from among a large database of genes, possesses a given biological property even where such genes are not strictly structurally similar.

SUMMARY AND OBJECTS OF THE INVENTION

[0042] A subject of the present invention is overcoming the disadvantages relating to techniques of the prior art for identifying biologically interesting molecules from among a large group of candidate molecules. This objective is achieved by providing a method for identifying putative peptides of a given function from among nucleotide or peptide sequences of unknown function by screening a database for the presence of a particular combination of nucleotides or amino acids indicative of peptides of given function.

[0043] Another subject of the present invention is to provide a method for identifying a putative peptide of a given function which is not limited to those peptides expressed in a particular biological source, for example because the protein is not expressed in that source.

[0044] Another subject of the present invention is to provide a method for identifying a putative peptide of a given function which does not depend on physical properties of the peptide, such as the isoelectric point or solubility, for identification.

[0045] Another subject of the present invention is to provide a method for identifying a putative peptide of a given function which is applicable to proteins which are furthermore biologically unrelated in their physical properties.

[0046] Yet another subject of the present invention is to provide a method for identifying a putative peptide of a given function from among candidate polypeptides which exhibit a very low degree of homology with each other.

[0047] Another subject of the present invention is to provide a method for identifying a putative peptide of a given function which facilitates pharmacological study of active substances having a fundamental physiological role in an organism such as hormones and, more particularly, amidated polypeptide hormones.

[0048] Yet another subject of the present invention is to provide a method for identifying a putative peptide of a given function which can be carried out with available genetic databases and available software.

[0049] Briefly described, these and other subjects of the invention are achieved by providing a method for identifying putative peptides of a given function from among nucleotide or peptide sequences of unknown function comprising the following steps:

[0050] (i) obtaining a polynucleotide or polypeptide database;

[0051] (ii) screening said database for the presence of a combination of nucleotides or amino acids indicative of the peptide of given function;

[0052] (iii) identification of the polynucleotide or polypeptide sequences which comprise the combination of nucleotides or amino acids indicative of the peptide of given function.

[0053] In a preferred aspect, the present invention provides a method for identifying a precursor of a peptide comprising an amidated C-terminal end comprising the following steps:

[0054] (i) obtaining a polynucleotide or polypeptide database;

[0055] (ii) screening the database for the presence of a combination of nucleotides or amino acids indicative of the precursor of the peptide comprising the amidated C-terminal end;

[0056] (iii) identification of the polynucleotide or polypeptide sequences which comprise the combination of nucleotides or amino acids indicative of the precursor of the peptide comprising the amidated C-terminal end.

[0057] The database preferably comprises polynucleotide sequences and/or polypeptide sequences corresponding to the polynucleotide sequences and, optionally, accession numbers for the polynucleotide sequences. Where the polypeptide sequence or sequences are not available in the database, they may be obtained by translating the polynucleotide sequences in said database. In a preferred embodiment, three different polypeptide sequences are obtained, corresponding to translation of three different reading frames of said polynucleotide sequences.

[0058] The database may further include annotational information relating to the polypeptide or polynucleotide sequences, such as at least one piece of information chosen from the origin, the source, the characteristics and the references for the sequences.

[0059] In screening the database, it is preferable to first locate the AUG start codon in a polynucleotide sequence and verify that no stop codon is present between the AUG start codon and the combination of nucleotides indicative of the precursor of the peptide comprising the amidated C-terminal end.

[0060] Once the step of identifying nucleotide or polypeptide sequences has been carried out, the identified sequences can be compared with sequences of known biological function and with identified sequences whose biological function is unknown. Alternatively, after the step of identifying nucleotide or polypeptide sequences has been carried out, the similarity of the selected sequences of unknown biological activity can be compared with sequences of known function and, if no similar sequence is found, the sequence of unknown biological activity can be selected for further investigation. If a similar sequence is found, the sequence of unknown biological activity can be selected as a candidate sequence exhibiting the putative function of the known similar sequence.

[0061] In one embodiment, the identified polypeptide sequence can be obtained and the properties of the polypeptide sequence can be evaluated.

[0062] The combination of nucleotides comprises the sequence Y1-Y2-Y3-Y4-Y5, in which Y1 is a nucleotide sequence of 1 to 12 nucleotides or is suppressed, Y2 is a codon for Gly, Y3 and Y4 independently are codons for Arg or Lys and Y5 is a nucleotide sequence of 1 to 21 nucleotides or Y5 is suppressed.

[0063] By reference to the preceding text as well as other objects, features and advantages of the invention that will become hereinafter apparent, the nature of the invention may be better understood by reference to the detailed description of the preferred embodiments and to the appended claims.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

[0064] By <<putative peptides of a given function>> is meant polypeptides from a particular oligonucleotide sequence which is characteristic of a particular function shared by a large number of proteins among a single species and/or among different species. In particular, certain oligonucleotide sequences have been found to be associated with certain types of proteins, such as C-terminal amidated hormones or amylases. The invention is applicable wherever there is an oligonucleotide sequence indicative of such a function.

[0065] By <<precursor of a peptide comprising an amidated C-terminal end>> is meant any of the precursor proteins which undergo an amidation reaction at the C-terminal end as described, for example, in Bradbury et al.

[0066] By <<polynucleotide or polypeptide database>> is meant any of the publicly available databases, such as FASTA, GENBANK, EMBL AND SP-TREMBL, PIR, REBASE, PROSITE or SWISS-PROT which typically include polynucleotide and/or polypeptide sequence data. The nucleotide sequence data is available, for example, at the EMBL, GENBANK organisations and at other places such as EXPASY. Such data will also often include ACCESSION numbers.

[0067] If the peptide sequences corresponding to a given ACCESSION number are available (for example in SWISS-PROT), then this <<validated>> sequence is included in the database. Otherwise, the nucleotide sequence is preferably translated using programs available in the art such as Translate. The translation is carried out using the data associated with the nucleotide sequence (3′ or 5′ end) and three different reading frames (N, N+1, N+2). If this information is not available, the translation is carried out using both the available nucleotide sequence and its complementary sequence (six putative peptides for a single nucleotide sequence).

[0068] Optionally, such database further includes annotations containing all of the information that is available for a particular sequence, such as ORIGIN, SOURCE, FEATURES, related REFERENCES and COMMENTS associated with the sequence. This facilitates an additional extraction from the database. An example of information available for annotation is given below: LOCUS HUMXT00347 239 bp mRNA EST 24-JUN-1992 DEFINITION Human expressed sequence tag (EST00347 similar to Repeat: CT), mRNA sequence. ACCESSION M62275 NID g340398 KEYWORDS EST; expressed sequence tag. SOURCE Homo sapiens (library: Stratagene catalog #936205) female 2 yr old Hippocampus cDNA to mRNA. ORGANISM Homo sapiens Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Eutheria; Primates Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 239) AUTHORS Adams, M. D., Kelley, J. M., Gocayne, J. D., Dubnick, M., Polymeropoulos, M. H., Xiao, H., Wu, A., Olde, B., Moreno, R. F., Kerlavage, A. R., McCombie, W. R. and Venter, J. C. TITLE Complementary DNA sequencing: Expressed sequence tags and human genome project JOURNAL Science 252, 1651-1656 (1991) MEDLINE 91262645 FEATURES Location/Qualifiers source 1..239 /organism=“Homo sapiens” /db_xref=“taxon:9606” /dev_stage=“2 yr old” /sex=“female” /tissue_type=“Hippocampus” /tissue_lib=“Stratagene catalog #936205” BASE COUNT 36 a 72 c 33 g 97 t 1 others ORIGIN 1 ttcatgctca tgtaaccttc ttaatagtgc cttgtctgct gggtttgtag ctgtaagagt 61 tdgcaaact ggccctataa aaatattgat gctgtccatt aaaatgaatc tctctctctc 121 actcagtctc tctctctgtc tgtctctctt tcttctctct cctgccatgt gtgtgtctct 181 ctctactcct ctgattttgn cctctctctc tattctgcta ctctctctcc tctcctccg

[0069] By <<combination of nucleotides or amino acids indicative of the precursor of the peptide comprising the amidated C-terminal end>> is preferably meant the sequence Y1-Y2-Y3-Y4-Y5, in which Y1 is a nucleotide sequence of 1 to 12 nucleotides or is suppressed, Y2 is a codon for Gly, Y3 and Y4 independently are codons for Arg or Lys and Y5 is a nucleotide sequence of 1 to 21 nucleotides or Y5 is suppressed. The particular combination is set forth in detail in the PCT Application WO 99/10361, the disclosure of which is hereby incorporated by way of reference.

[0070] In the preferred embodiment, the database to be screened is obtained by running a script very similar to the sql/plus script which follows: -- ============================================ ===-- Name of the base: TGEN -- Name of SGBD : ORACLE Version 7.0 -- ============================================ -- ============================================ -- Table: GBEST ============================================ drop table GBEST; create table GBEST ( SODEFINITION VARCHAR2 (500), SOACCESSION VARCHAR2 (121) not null, SOORGANISM VARCHAR2 (500), SOSEQUENCE LONG ) storage (initial 300M next 10M pctincrease 0;) ============================================ -- Table: NUCSEQENV -- ============================================ drop table NUCSEQENV create table NUCSEQENV ( SOLOCUS VARCHAR2 (121), SOACCESSION VARCHAR2 (121) not null, SONID VARCHAR2 (121), SOSOURCE VARCHAR2 (2000), SOREFERENCE VARCHAR2 (2000), SOCOMMENT VARCHAR2 (2000), SOFEATURES VARCHAR2 (2000), SOBASECOUNT VARCHAR2 (121), ) storage (initial 300M next 10M pctincrease 0); ============================================ -- Table: PEPSEQ ============================================ drop table PEPSEQ; create table PEPSEQ ( SOACCESSION VARCHAR2 (121) not null, SOPHASE VARCHAR2 (1), SOPEPTIDE VARCHAR2 (2000). SOPEPDEFINITION VARCHAR2 (500) SOPEPTIDEORIGIN VARCHAR2 (121) ) storage (initial 300M next 10M pctincrease); ============================================ --Index: ALL Indexes ============================================ drop index GBEST_PK; drop index NUSEQENV_PK; drop index PEPSEQ_PK; create index GBEST_PK on GBEST (SOACCESSION asc) Tablespace GENOMX; create index NUSEQENV_PK on NUSEQENV (SOACCESSION asc) Tablespace GENOMX; Create index PEPSEQ_PK on PEPSEQ (SOACCESSION asc) Tablespace GENOMX;

[0071] It is very obvious that the number and the size of the different ORACLE fields can be adjusted in order to fit the amount and the size of the data that is being imported into the tables. If another type of information must be added to public data bases a new field will be created in a corresponding fashion in ORACLE.

[0072] For example, the SOLOCUS, SODEFINITION, SOACCESSSION, SOORGANISM, SONID, SOKEYWORDS SOSOURCE, SOREFERENCE, SOCOMMENT, SOFEATURES, SOBASECOUNT, SOSEQUENCE fields in oracle contain the LOCUS, DEFINITION, ACCESSSION, ORGANISM, NID, KEYWORDS, SOURCE, REFERENCE, COMMENT, FEATURES, BASECOUNT, SEQUENCE entries respectively of the genetic database files.

[0073] SOPHASE contains the reading frame (0, +1, +2) if the peptide sequence is generated using IHB's DNA translator, SOPEPTIDE contains the peptide sequence corresponding to a given SOACCESSION, finally SOPEPTIDEORIGIN indicates how the peptide sequence was obtained (IHB's DNA Translator, GENBANK, SWISS-PROT etc.).

[0074] If a FASTA formatted file (see below) is used instead of a GENBANK formatted file, the header line is stored in the SODEFINITION field when we are dealing with a nucleotide sequence and the nucleotide sequence itself is stored in the SOSEQUENCE field. If the data being inserted in the database is a peptide sequence then the header line is stored in SOPEPDEFINITION and the peptide sequence is stored in the SOPEPTIDE field, once again the SOPEPTIDEORIGIN will indicate where the peptide sequence comes from (GENBANK, SWISS-PROT, etc.)

[0075] gi|1402336 (M17352) dnaN protein [Salmonella typhimurium]

[0076] MKFTVEREHLLKPLQQVSGPLGGRPTLPILGNLLLQVADGALSLTGTDLEMEMVARVTL SQP

[0077] >gi|306148 (L19604) core polypeptide [Heliobacillus mobilis]

[0078] MATADAAFNPRAQVFEWFKDKVPATRGAVLKAHINHLGMVAGFVSFVLVHHLSWLSD QVLFAPTPIFYAR

[0079] LYQLGLDASARSADALMVARLHLPAAIIFWIIGHIKTPREDEFLKNVTFGKTLVAQFHFLA LVATLWGMH

[0080] MAYIGVRGANGGIVPTGLSFDMFGPITGATLAGNHVAFGALLFLGGVFHHFAGFNTKRF AFFEKDWEAVL

[0081] SVSAQVLAFHFATVVFAMIIWNRPDQPILSFYFMQDYALSNYAAPEIREIASQNPGFLIKQ VILGHLVFG

[0082] VMFWIGGVFHGASLHVRATNDPKLAEALKDFKMLKRCYDHDFQKKFLALIMFGAFLPIF VSYGIATHNTI

[0083] SDLHHLAKAGMFANMTYINIGTPLHDAIFGSHGTVSDFVAAHAIAGGLHFTMVPLWRMV FFSKVSPWTTK

[0084] VGMKAKRDGEFPCLGPAYGGTCSISLVDQFYLAIFFSLQVIAPAWFYLDGCWMGSFVA TSSEVYKQAAEL

[0085] FKANPTWFSLHAVSNFTSEVTSATSSLKPLVCSNTTMVTWFKPCWAAHFIWAFTFSML FQYRGSRDEGAM

[0086] VLKWAHEQVGLGFAGKVYNRALSLKEGKAIGTFLFFKMTVLCMWCLAMV

[0087] >gil|153320 (L05390) hydroxylase [Streptomyces halstedii]

[0088] MNARADRAGDTVHRVPVLWGGSLVGLSTSVFLGRLGVRHMLVERHAGTSVHPRGRGN NVRTMEVYRAAG VEQGIX.

[0089] During compilation of the database for screening, standard file formats used to store <<High Throughput Sequencing>> programs and other <<Genome>> programs (e.g., FASTA, GENBANK) are read and manipulated. Then, the portions of data which are relevant for the screening step, e.g., the annotation data, are identified. The resulting fields are inserted into Oracle. If the nucleotide sequence is a 5′ end, the sequence is directly translated into a peptide sequence, if it is a 3′ end, the complementary sequence of the given sequence is generated and is used for translation (the translation phase takes into account the three possible reading frames to generate the peptide sequence). The last step involves the prediction of the secondary structure of the peptide, which can be based on information theory such as the programme developed by J. Garnier, D. Osguthorpe, and B. Robson. The software uses all possible pair frequencies within a window of 17 amino acid residues. After cross validation on a database of 267 proteins, the predication has a mean accuracy of 64.4% for a three state prediction (helix, beta strand, and coil). The program produces two outputs, one giving the sequence and the predicted secondary structure, the other giving the probability values for each secondary structure at each amino acid position. The predicted secondary structure is the one of highest probability compatible with a helix segment of at least four residues and an extended segment (beta strand) of at least two residues.

[0090] Once the relevant data has been assembled in the database, such a database can be screened for the presence of a combination of nucleotides or amino acids indicative of the precursor of the peptide preferably those comprising the amidated C-terminal end. The purpose of this step is of course to convert the huge amount of unsorted data into a limited set of putative peptides of pharmaceutical interest (potential hormones or hormone fragments, or endogene receptor ligands and the like). The general description of the process is followed by an application with a search of potential amidated peptides in the Expressed Sequence Tags database.

[0091] The general process can in particular, proceed as follows:

[0092] Extraction of automated sequences from a public source, such as the internet (EMBL, GENBANK, SWISS-PROT, PDB, etc.)

[0093] Analyze file and import data into ORACLE

[0094] Select sequences from a subset (e.g. GENBANK EST) or from the entire database

[0095] Search for all of the sequences exhibiting a specific motif of interest, such as precursors of a peptide comprising an amidated C-terminal end

[0096] Verification that no STOP codon is present between the AUG codon indicating the beginning of the reading frame and the sought motif;

[0097] Selection of the sequences of unknown biological function

[0098] Verification that when found, the motif is an open reading frame (Kozak consensus sequence)

[0099] Comparison of the environment of the motif location with the one required (e.g., secondary structures required around a maturation site such as the proximity of alpha-helices or beta-sheets).

[0100] Verification for similarity of the sequences and other known sequences (DEFINITION field)

[0101] Use of successive threading techniques to search sequences displaying a similar secondary structure if no similar structures are defined in the database

[0102] If no similar sequence is found, selection of the sequence as a synthetic candidate whose function has to be determined

[0103] If similar sequences of known function are found, the sequences can be selected as a synthetic candidate whose putative function is that of the similar sequence.

[0104] The following examples are given by way of illustration and should in no way be interpreted as limiting the subject matter disclosed and claimed.

EXAMPLES

[0105] Using the method described above, the following peptides were identified and tested: Number Sequence Origin  1 H-GQDSIEPVPGQK-NH2 DbEST  2 H-YARVQWA-NH2 pre-pro-bradykinin  3 H-YFKIDNVKKARVQWA-NH2 pre-pro-bradykinin  4 H-PLEPSGG-NH2 HYAA gene  5 H-ELGRGPGPPLPERGA-NH2 HYAA22 gene  6 H-YERNRQAAAANPENSRGK-NH2 Neurotrophic factor derived from the glial cell line  7 H-SLLSKVSQ-NH2 Protein repairing DNA (XP-C cells)  8 H-VTQDPKLQM-NH2 CD4 glycoprotein precursor (surface T cells)  9 H-DFPEEVAIVEEL-NH2 Glucagon precursor 10 H-MDVGGLSDPYVKVHLLQG-NH2 Synaptotagmin V 11 H-DPSLPVASSSSSSSKR-NH2 Protein repairing DNA and complementing the XP-C cells 12 H-PLLGSTLFIPI-NH2 D-β-hydroxybutyratedhydrogenase precursor 13 H-DSGFQMNQLR-NH2 Serotransferrin precursor (siderophilin) 14 H-LQLEETMPSPY-NH2 DNA topoisomerase II β isozyme 15 H-FSIATLRDFGV-NH2 16 H-FSVTTMRDFGM-NH2 Cytochrome P450 (inducible with phenobarbital) 17 H-DSSHAFTLDELR-NH2 Melanotransferrin precursor 18 H-DMVVFLDGGQLGTLV-NH2 19 H-LHISHDMTGPD-NH2 20 H-SLEGIFDDIVPD-NH2 Protein 1 inducing the invasion of T- lymphomes and metastases 21 H-SNYFMPFSA-NH2 Cytochrome P450 (mephenytoin-4- dehydroxylase) 22 H-SDAFVPFSI-NH2 Cytochrome P450 23 H-RTGALVLSRG-NH2 Leukosialin precursor (leucocyte sialoglycoprotein) 24 H-ESSFQPEAGF-NH2 25 H-GGPISFSSSRS-NH2 Myosin heavy chain (type B non muscular) 26 H-IEKEAAQLQ-NH2 27 H-SDTSLTWNSVK-NH2 lactotransferrin precursor (lactoferrine) 28 H-FSLMTLRNFGM-NH2 P450 cytochrome (mephenytoin-4- dehydroxylase) 29 H-FQLPLDKGN-NH2 30 H-SFSIIGDFQN-NH2 Precursor of Von Willebrand factor 31 H-ALEKLDGTEVN-NH2 Pre-mRNA splicing factor SRP75 32 H-VAEIQGHAG-NH2 Morphogenetic bony protein 4 precursor 33 H-LHNILGVETGGPG-NH2 34 H-SASDLTWDNLK-NH2 Serotransferrin precursor 35 H-ISELFTLE-NH2 DNA polymerase ε (catalytic sub- unit A) 36 H-GGPGGPPGPLMEQMG-NH2 RNA bonding protein 37 H-ALEWLGADRNE-NH2 Protein linked with FBKP-rapamycin 38 H-DVDFEGTDEPIF-NH2 Hypothetic protein fragment 39 H-GATPGKALVATP-NH2 Nucleolin 40 H-MKGPEVMAFIEQ-NH2 Tyrosine protein kinase 41 H-ESKLERTPQKNVQ-NH2 Sub-unit of activator 1 42 H-PRATTPKTVRS-NH2 Histone H1T 43 H-MKTRQNKDSMSMRS-NH2 Ribosomal protein S18 44 H-ERMGHHDDYYSRLR-NH2 Auxiliary factor of short nuclear ribonucleoprotein 45 H-LVEHYPEFI-NH2 46 H-FSVSTLRNLGL-NH2 47 H-SGTLIKIFQAS-NH2 48 H-FAEQDAKEEANKAM-NH2 Binding protein FK506 (peptidyl- prolyl cis-trans deisomerase) 49 H-ETKHGGHKN-NH2 Aspartyl asparaginyl β-hydrxylase 50 H-LQEALSKAA-NH2 Hypothetic protein 51 H-PWTAVDTSVD-NH2 Precursor of cation-independant mannose-6-phosphate receptor 52 H-VIQYFASIAAIGDR-NH2 Myosine heavy chain (isoform a, cardiac muscle) 53 H-GYIGVVNRSQKDID-NH2 Dynamin-1 54 H-QLWDVAHSVKEKF-NH2 Glycogen synthase (liver) 55 H-GNETSFVPSRRSG-NH2 Heavy chain of myosine (isoform of smooth muscles) 56 H-TDIFGVEETAI-NH2 Protein 114 associated with the spliceosome 57 H-QTAVTAVEKPADK-NH2 58 H-EVAMDDHKLSLDEL-NH2 a-2 chain of ATPase transporting sodium and potassium 59 H-TVTPAKAVTTP-NH2 Nucleolin 60 H-MEAETGSSVET-NH2 61 H-QWAQFKIQWNQRW-NH2 62 H-DYSKGITVTKND-NH2 Hypothetic protein 63 H-VIQYLAHVASSHK-NH2 Myosine heavy chain non muscular) (type B) 64 H-YPKPQQFFGLM-NH2 65 H-EPPKEETAQLTGPEA-NH2 Large proline-rich BAT-2 protein 66 H-YMHGHRAPG-NH2 67 H-MGKWHVG-NH2 68 H-SSSHSLSHK-NH2 69 H-HVGLLRIK-NH2

[0106] Pharmacological Properties of the Identified Peptides

[0107] 1) Methods

[0108] Binding Assay Alternative Protocol by Skatron

[0109] The experimental conditions are identical for all the binding assay protocols except for the fact that:

[0110] i) the volume R in operation is 200 μl;

[0111] ii) the test carried out in 96-well Falcon plates

[0112] iii) the reaction is directly stopped by filtration on filterMat (ref. 11734) Skatron and the radioactivity associated with the filtrate is evaluated:

[0113] either directly with a γ counter for the peptides labelled with iodine 124:

[0114] or with a γ counter in the presence of 5 ml of scintillating liquid.

[0115] Membrane Preparations of Guinea Pig Brains and Binding Assay

[0116] The guinea pigs are sacrificed by rupturing their cervical vertibrae, decapitated, and the brains are removed very rapidly and placed into a sucrose-Tris-HCl buffer (0.32 M sucrose 5 mM, Tris-HCl, 0.1 g/l bacitracin) at 4° C. (about 10 ml of buffer per brain). The brains are then homogenized using a Potter device and the homogenate is placed in incubation for 30 minutes at 37° C. under agitation, then centrifuged twice for 35 minutes at 100,000×g at 4° C. The protein content of the membrane preparations is evaluated according to the Bradford method (BioRad, according to the manufacturers protocol).

[0117] Bonding of Labelled Agonists to the Guinea Pig Brain Membrane Preparation

[0118] The membranes are placed in a buffer (50 mM of Tris-HCl, 5 mM of MgCl₂ and 0.1 g/l of bacitracin) at the desired protein concentration (iodized CCK bond, 0.1 mg of protein/ml; iodized gastrin bond: 0.5 mg of proteins/ml). They are then incubated in the presence of a labelled ligand in a total volume of 500 μl (about 10 pM for the iodized CCK₈, 20 pM for the iodized gastrin₁₃), 50-80 minutes at 25° C., and in the presence or absence of cold agonists. The reaction is stopped with 3 ml of additional BSA buffer (20 g/l) at 4° C., the tubes are centrifuged at 10,000×g, the supernatant is drawn off and the radioactivity associated with the precipitate is evaluated with a γ counter.

[0119] Bonding of Radiolabelled Agonists to Jurkat T cells

[0120] Culture Conditions of the Cells of the Jurkat T Human Lymphocytic Line

[0121] The Jurkat cells are cultured in an RPMI 1640 medium supplemented with foetal calf serum (10% volume/volume) and antibiotics (50 U/ml of penicillin and 50 μg/ml of streptomycin) serum in a humid incubator at 37° C. under an atmosphere of 5% CO₂ in air.

[0122] Bonding Test

[0123] The cells are obtained by centrifugation (514 g, 5 minutes), then are washed twice in a standard medium containing: 98 mM of NaCl; 6 mM of KCl; 2.5 mM of NaH₂PO₄; 1.5 mM of CaCl₂, 1 mM of MgCl₂, 5 mM of Na-pyruvate; 5 mM of Na-fumarate; 5 mM of Na-glutamate; 2 mM of glutamine; 11.5 mM of glucose; 24.5 mM of Hepes (N-[2-hydroxyethyl]piperazine-N′-[2-ethane sulfonic acid]); 0.5 g/l of bacitracin; 0.1 g/l of soya bean trypsin inhibitor, pH 7.4

[0124] Experiments with agonist bonds labelled with iodine 125 (about 50 picomolars) are carried out at 37° C. under agitation for 45-60 minutes in a final volume of 0.5 ml of standard medium containing 2×10⁶ cells (4×10⁶ cells/ml) and in the presence or absence of competitors. The non-specific bond is evaluated in the presence of a micromolar concentration of cold homologous peptide.

[0125] Membrane Preparations of Different Rat Tissues and Organs

[0126] 1) The rats are sacrificed by rupturing their cervical vertebrae, and the different tissues are removed and washed several times in a large volume of NaCl at 9 per thousand.

[0127] The following steps are carried at 4° C.

[0128]2) The tissues and organs are transferred, still separately, in a sucrose buffer containing:

[0129] 0.25 M of sucrose

[0130] 25 mM of TRIS-HCl, pH 7.4.

[0131] 0.2 mM of PMSF

[0132] 0.1 mM of 1-10 phenantrolin

[0133] 100 μg/ml of STI

[0134] They are then cut up carefully with the aid of fine scissors and crushed using an ultraturax apparatus.

[0135] 3) The tissues and organs are finally ground using a Potter apparatus: (a minimum of ten times in each rotational direction at a minimum of 1000 rpm).

[0136] 4) The ground material is centrifuged at 500 g and for 10 minutes at 4° C. The supernatants are saved, the pellets placed in a sucrose buffer and centrifuged again under the same conditions.

[0137] 5) The 2 supernatants are then mixed and centrifuged at 100,000×g for 30 minutes at 4° C.

[0138] 6) The supernatants are discarded and the pellets placed again in a sucrose buffer for centrifuging a second time at 100,000×g for 30 minutes at 4° C.

[0139] 7) The supernatants are discarded, and the pellets taken up in a binding buffer containing 50 mM of TRIS-HCl (pH 7.4) buffer and 5 mM of MgCl₂.

[0140] 8) The protein content in the membrane suspensions is evaluated using a Bradford dosage device (BioRad, acording to the manufacturer's instructions). They are aliquoted and stored in liquid nitrogen at −80° C.

[0141] The bonding tests are carried out following a protocol identical to that used with the guinea pig brain membranes.

[0142] Studies of the Bonding of Labelled Agonists on the HeLa Cell Line

[0143] The cells are cultured in a DMEM medium supplemented with 10% foetal calf serum, 0.5% of antibiotics (penicillin/streptomycin) and 0.5% of glutamine.

[0144] 24 to 48 hours before experimentation, the cells are replicated in 24-well plates with approximately 100,000 cells per well and per ml.

[0145] The bonding studies are carried out with a BindH buffer NaCl 98 mM 5.72 g/l KCl 6 mM 0.45 g/l NaH₂PO₄ 2.5 mM 0.3 g/l Na-pyruvate 5 mM 0.55 g/l Na-fumarate 5 mM 0.58 g/l Na-glutamate 5 mM 0.84 g/l CaCl₂ 1.5 mM 0.22 g/l MgCl₂ 1 mM 0.20 g/l HEPES 25 mM 6.07 g/l Glucose 11.5 mM 2.07 g/l Glutamine 2 mM 0.22 g/l STI 0.1 g/l

[0146] For the experiment, the medium is removed and the cells washed twice with 1 ml of buffer. The reaction medium (500 μl) containing the labelled agonist is added into each well in the presence or absence of the cold homologous agonist and the plates are incubated for one hour at 37° C.

[0147] The reaction is stopped by removal of the reaction volume and the wells are rinsed twice with 1 ml of BindH with 20% BSA.

[0148] The cells are lysed with 500 μl of 1N soda for 20 minutes at ambient temperature and the radioactivity associated with the lysate is evaluated using a gamma counter.

[0149] 2) Results

[0150] The thus identified peptides were tested according to the protocols described above.

[0151] The results obtained for peptides 1 to 6 are set out in the following table. Example Binding test with Results 1 guinea pig brain membranes positive (IC₅₀ = 100 μM) different rat organ membranes negative HeLa cells positive (IC₅₀ = 100 μM) Jurkat T cells negative 2 guinea pig brain membranes positive (IC₅₀ = 100 μM) different rat organ membranes positive (IC₅₀ = 100 μM) HeLa cells positive (IC₅₀ = 100 μM) Jurkat T cells positive (IC₅₀ = 100 μM) 3 guinea pig brain membranes positive (IC₅₀ = 100 μM) membranes of the brain/liver positive (IC₅₀ = 100 μM) and different rat organs positive (IC₅₀ = 2 μM) HeLa cells positive (IC₅₀ = 50 μM) Jurkat T cells 4 guinea pig brain membranes positive (IC₅₀ = 100 μM) different rat organ membranes negative HeLa cells positive (IC₅₀ = 100 μM) Jurkat T cells positive (IC₅₀ = 100 μM) 5 guinea pig brain membranes positive (IC₅₀ = 100 μM) rat brain membranes positive (IC₅₀ = 90 μM) different rat organ membranes positive (IC₅₀ = 100 μM) Jurkat T cells negative 6 guinea pig brain membranes positive (IC₅₀ = 100 μM) rat brain membranes positive (IC₅₀ = 90 μM) different rat organ membranes positive (IC₅₀ = 100 μM) Jurkat T cells negative

[0152] Besides peptides 1 to 6, the following peptides were obtained and tested on guinea pig brain membranes: Example IC₅₀ = 100 μM Example IC₅₀ = 100 μM  7 0.43 16 0.08  8 4.2 17 0.68  9 0.22 18 0.8 10 0.03 19 0.7 11 0.27 20 0.35 12 0.02 21 0.16 13 0.09 22 2.8 14 0.86 23 1.3 15 0.36 24 0.23 25 0.31 48 1.47 26 0.7 to 22 49 0.33 27 0.68 50 1 28 0.4 51 0.14 29 0.01 52 7 30 27 53 0.01 to 1 31 0.02 to 2 54 0.3 32 0.2 55 0.004 to 1 33 0.1 56 0.73 34 1 57 0.06 35 5.6 58 0.86 36 2.9 59 0.41 37 0.8 60 <0.1 to 4 38 0.5 61 1.11 39 0.4 62 0.67 40 0.13 63 0.02 41 0.25 64 0.001 42 0.56 65 4 43 0.05 66 0.002 44 0.1 67 0.16 45 0.7 68 0.025 46 0.1 69 0.037 47 0.3

[0153] Although only preferred embodiments are described and claimed here, it will be appreciated that modifications can be made to the preferred embodiments without departing from the spirit and intended scope of the invention 

1. Method for identifying putative peptides of a given function from among nucleotide or peptide sequences of unknown function comprising the following steps: (i) obtaining a polynucleotide or polypeptide database; (ii) screening said database for the presence of a combination of nucleotides or amino acids indicative of the peptide of given function; (ii) identifying the polynucleotide or polypeptide sequences which comprise the combination of nucleotides or amino acids indicative of the peptide of given function.
 2. Method for identifying a precursor of a peptide comprising an amidated C-terminal end comprising the following steps: (i) obtaining a polynucleotide or polypeptide database; (ii) screening said database for the presence of a combination of nucleotides or amino acids indicative of the precursor of the peptide comprising the amidated C-terminal end; (ii) identifying the polynucleotide or polypeptide sequences which comprise the combination of nucleotides or amino acids indicative of the precursor of the peptide comprising the amidated C-terminal end.
 3. Method according to claim 2, characterised in that said database comprises at least one of the polynucleotide sequences and polypeptide sequences corresponding to said polynucleotide sequences.
 4. Method according to claim 3, characterised in that said database comprises both polynucleotide sequences and polypeptide sequences corresponding to said polynucleotide sequences.
 5. Method according to claim 3, characterised in that said database comprises polynucleotide sequences and accession numbers for said polynucleotide sequences.
 6. Method according to claim 3, characterised in that said database comprises polypeptide sequences.
 7. Method according to claim 4, characterised in that said polypeptide sequences are obtained by translating the polynucleotide sequences in said database.
 8. Method according to claim 7, characterised in that three different polypeptide sequences are obtained, corresponding to translation of three different reading frames of said polynucleotide sequences.
 9. Method according to claim 3, characterised in that said database further comprises annotational information relating to said polypeptide or polynucleotide sequences.
 10. Method according to claim 9, characterised in that said annotational information comprises at least one piece of information chosen from the origin, the source, the characteristics and the references for said sequences.
 11. Method according to claim 2, characterised in that said step of screening said database further comprises the steps of locating the AUG start codon in a polynucleotide sequence and verifying that no stop codon is present between said AUG start codon and the combination of nucleotides indicative of the precursor of the peptide comprising the amidated C-terminal end.
 12. Method according to claim 2, further comprising the step, after the step of identifying polynucleotide or polypeptide sequences, of comparing said identified sequences with sequences of known biological function and selecting identified sequences from this group whose biological function is unknown.
 13. Method according to claim 12, further comprising the step of comparing the similarity of said selected sequences of unknown biological activity with sequences of known function and, if no similar sequence is found, selecting said sequence of unknown biological activity for further investigation.
 14. Method according to claim 12, further comprising the step of comparing the similarity of said selected sequences of unknown biological activity with sequences of known function and, if no similar sequence is found, selection of said sequence of unknown biological activity as a candidate sequence exhibiting the putative function of the known similar sequence.
 15. Method according to claim 2, further comprising the step of obtaining the identified polypeptide sequence and evaluation of the properties of said polypeptide sequence.
 16. Method according to claim 2, characterised in that said identified polypeptide sequence is H-GQDSIEPVPGQK-NH₂; H-YARVQVVA-NH₂; H-YFKIDNVKKARVQVVA-NH₂; H-PLEPSGG-NH₂; H-ELGRGPGPPLPERGA-NH₂; or H-YERNRQAAAANPENSRGK-NH₂.
 17. Method according to claim 2, characterised in that the combination of nucleotides comprises the sequence Y1-Y2-Y3-Y4-Y5, in which Y1 is a nucleotide sequence of 1 to 12 nucleotides or is suppressed, Y2 is a codon for Gly, Y3 and Y4 independently are codons for Arg or Lys and Y5 is a nucleotide sequence of 1 to 21 nucleotides or Y5 is suppressed. 